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ABSTRACT 

The objective of this study was to show that 
standardized reading scores could be adequately estimated from scores 
oxi a criterion**referenced test in reading. This would reduce 
classroom test time, while, at the same time, provide the kinds of 
information teachers need to guide instruction, and the kinds of 
information administrators require for making decisions regarding 
education programs. Stepwise regression and equipercentile equating 
were used to estimate scores from the criterion^referenced scores. 
The results show that it is possible to estimate normative scores 
from a broad based criterion^ referenced test in reading. (Author) 
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In the literature or? educational measurement discussions frequently 
appear on the differences between standardized, norm-referenced tests and 
criterion*referenced tests. The differences are in their construction, 
interpretation, and use. Norm-referenced achievement tests are constructed 
to measure broad educational goals, and items are selected to discriminate 
the amount of knowledge or skill a student has in a particular achievement 
domain. Construction procedures tend to spread kids out on a continuum and 
bring out individual differences where they exist (with respect to perform- 
ance on the particular test) . Criterion-referenced tests are constructed 
to measure specific educational objectives, and items are selected to dis- 
criminate between individuals who have or have not mastered these particular 
objectives. Construction procedures tend to maximize the instructional 
effects on scores rather than the individual differences of students. Norm- 
referenced tests are useful for long-term evaluation of educacional progress, 
while criterion-referenced tests are useful for evaluating short-term instruc 
tion and, therefore, for assisting teachers in diagnosing strengths and 
weaknesses of students and planning their instruction. 

Criterion-referenced test information is most useful to the classroom 
teacher in giving insight into, and guidance for, instruction. Such tests 
provide diagnostic and prescriptive information about each student that 
allows the teacher to plan instruction for groups and individual students 
best suited to meet their individual instructional needs. There is, however, 
a continuing demand and need by educational administrators, legislatures, and 
the general public for comparative or normative data on students in order to 
make intelligent decisions about the allocation of resources and in order to 
know how they stand with respect to national, state, or district performance. 



Time taken from instructional time to administer standardized^ 
norm*re£erenced tests in the classroom, which have no direct effect on 
instruction, is perceived by teachers and students as wasted time. If it 
is possible to use the same instrument to provide teachers the kind of 
information they need, that is, criterion-referenced information, and, at 
the same time provide administtators the kind of information they need, 
that is, norm-referenced information^ then the time and effort put into 
testing by teachers and students alike will be perceived as useful and less 
threatening to both. 

It is possible, of course, to norm a 6riter ion-referenced test or to 
criterion-reference a norm-referenced test thus using the same test for 
bo^h purposes* The consensus ^ however, seems to be that the differences 
between the two kinds of tests are such that a criterion-referenced norm- 
referenced test will be a poor substitute for a well constructed criterion- 
referenced test and that a normed criterion-referenced test will be a poor 
substitute for a well constructed norm-referenced test (see Hambleton & 
Novick, 1973; Messick, 1974). 

We have been conducting studies to determine the relationships between 
the two types of tests and have found that by using regression analyses and 
equating techniques, a good, comprehensive criterion-referenced test can 
produce normative test results about as well as a norm-referenced test. It 
is interesting to note that our data show that the reverse is not true. 

METHOD 

The tests used in this study were the Reading Vocabulaty, Reading 
Comprehension, and Reading Total scores from the California Achievement 



Tests , 1970 Edition (CAT-70), a well-known nationally normed achievement 
series, and the Prescriptive Reading Inventory (PRI), a comprehensive 
criterion-referenced test of reading skills measuring about ninety object- 
ives in four overlapping levels covering most of what is taught in reading 
from Grades 1.5 through 6. Both tests are published by CTB/McGraw-Hill. 



The data for 


this study were 


collected 


in the fall of 1972 


as part ( 


Larger overall 


study of the PRI. 


The data 


collected were as 


follows: 




•- 


CAT-70 


Ethnic 


No. of 


Grade 


Level 


Level 


Code 


Cases 


1.5 


A 


1 


Stand 


555 


* 2.2 


A 


1 


Stand 


963 


2.2 


B 


1 


Stand 


685 


2.2 


B 


1 


Black 


935 


* 3.2 


B 


1 


Black 


916 


* 3.2 


B 


2 


Stand 


742 


3.2 


C 


2 


Stand 


615 


* 4.2 


C 


2 


Stand 


993 


4.2 


D 


2 


Stand 


539 


* 5.2 


D 


3 


Stand 


1498 


6.2 


D 


3 


Stand 


1773 



The procedure was to select one grade/level combination for each level 
of the PRI plus an additional data set from the black sample at level B of 
the PRI for the regression analysis. The selected grade/level combinations 
are indicated by asterisks in the table above. For each of these cells 
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(grade/level combinations), 70% of the available data was randomly selected 
as the regression sample and the remaining (random) 30% was used for cross 
validation. Using a stepwise regression program, weights for predicating the 
three CAT-70 raw scores from PRI objectives scores were obtained. The weights 
were cross validated with the remaining 30% of the data in each of these 
cells and, as a more stringent test of validity, the same weights were 
validated using a random 30% of the data from adjacent grades where the 
same level of the PRI and CAT-70 had been administered. There were, then, 
five regression analyses and nine cross validations. Some additional 
analyses were done using CAT-70 standard scale scores, but these analyses 
will be described below. 

RESTJLTS 

The results of the raw score regression analyses and cross validations 
thus far described are summarized in Table 1. I*: reading this table, note 
that the correlations under the regression analysis column are multiple 
correlations from the regression analyses. The correlations under the 
cross validation column are correlations between predicted and obtained 
CAT-70 reading scores in the validation samples, and the correlations under 
the alternate form CAT correlations column are the simple correlations 
between the reading scores from Form A and Form B of CAT-70. This table 
speaks pretty well for itself. The validity coefficients are quite high. 



Insert Table 1 about here 



sometimes exceeding both the multiple correlationc from the regression 



ERLC 



-5- 



analyses and the alternate form correlations. They ar? somewhat lower for 
the black sample, particularly when computed on data from a different grade. 
This may be a consequence of lower reliabilities for the Grade 2 black 
scores. There are also some marginally large differences between the actual 
CAT-70 means and the means of the predicted CAT scores when the weights are 
applied across grades. This occurs for the black sample and at Level D. 
The largest difference is -3.45 raw score points for Reading Total in the 
black sample. This difference represents about 7 percentile points or 1 to 
2 months in grade equivalent score. The differences in means for cross 
validation at the sa^ne grade level are all less than one raw score point. 
Overall, these data suggest that the predicted CAT-70 reading scores from 
the PRI are about as good as an alternate form of CAT-70 itself. 

Scale Score Analysis 

Having proved to ourselves that predicting normative reading scores 
from the PRI was quite feasible and practical, we now wanted to obtain a 
single regression equation for each of the four levels of the PRI that would 
optimally predict CAT-70 scale scores, which are independent of the CAT 
level and from which derived scores (percentiles, grade equivalents) are 
easily obtainable. We also wanted to investigate further the equatability 
and scalability of the predicted scale scores. 

We first converted the CAT-70 raw scores to scale scores, then pooled 
all of the data for a given level of the PRI to rerun the regression analy- 
ses. For Level A this included first and second grade data, for Level B 
second and third grade data for both the standard and black samples, for 
Level C third and fourth grade data, and Level D fourth, fifth, and 




sixth grade data. The four regression analyses were run and the weights 
obtained were adjusted to give the same mean and standard deviation for the 
actual and predicted scale scores and these were then applied in turn to 
each of the groups making up the data pool for that level of the PRI. The 
results of these analyses are shown in Table 2. Note first that the corre- 
lations hold up nicely, as would be expected. There are some differences 
in actual means and the means of the predictions. These, however, are not 
serious. The largest difference is -6.3 scale score units which occurs in 
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Reading Comprehension for the Grade 2 standard sample. This difference 
represents slightly more than one raw score unit which Is 1 or 3 percentile 
points or about one month in grade equivalent score. 

In addition to the regression analyses, we obtained distributions of 
the actual and predicted scale scores for each group and of the differences 
between them (actual minus predicted). These distributions show some 
interesting properties of the predicted scale scores. The distributions of 
obtained scale scores can have data only at particular points on the scale 
score continuum corresponding to particular raw scores. These points may 
be separated by two or three scale score points near the middle of the 
distribution or by twenty or more scale score points near the ends of the 
distribution; that is, for obtained scale scores, missing an item (or getting 
an additional item correct) may change the scale score obtained by as much 
as twenty or more points. The predicted scale scores are based on a weighted 
composite of 30 to 35 objective scores each made up of thrje to five items. 

8' 



For this reason, any scale score (including fractional ones) within the 
range of the test are possible* A change in performance on one or several 
items in the PRI will not change the predicted scale score very much. 
Typically in these distributions, there are data at every scale score point 
throughout the range of the te5t except at the ends of the distributions 
where the frequencies fall to zero abruptly. This effect may be due to 
the fact that to obtain a very high scale score a student must pass a large 
number of items, rather than just getting one :wo more itemd correct than 
the other students in the sample and, similarly, in order to get a very low 
predicted scale score a student must fail a large number of items. Assuming 
that the test is reasonably within the functional range for the students in 
the sample, either of these events is unlikely. Figure 1 shows this effect 
very nicely and is typical of all of the group distributions. In this 
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figure, the obtained and predicted scala scores are plotted against the 
normal deviates from the distributions for one test and one grade. The over 
prediction at the low end and the under prediction at the high end of the 
distribution are clear. 

The ouuained scale score distribution forms practically a straight 
line and these scores are approximately normally distributed. The distribu- 
tion of predicted scale scox'es is plat ikur tic and throughout most of the 
range of the test is more like a uniform distribution than the normal. The 
predicted scale scores rank order students very well — better than do the 
obtained scale scores. 
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Summary statistics from the distributions of difference scores are 
shown in Table 3. In looking at these statistics » recall that these dif- 
ferences are becween fixed points on the scale score continuum representing 
corresponding raw score points and scores which range through all scale 
score pi^incs within the range of the test. Also bear in mind the over and 



Insert Table 3 about here 



under predictions at the low and high ends of the distributions, respectively • 
Though most of the mean differences fait respectably close to zero, there is 
considerable variation in the accuracy of predicting individual scores. The 
standard deviations range from about 20 to 44 scale score points. Before 
we at CTB attempt to make any predictions of individual scores, we will 
attesqpt to tmprove accuracy by doing an equipercentlle equating of the 
dlatributions of obtained and predicted scale scores. 
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Footnote 

1. I wish to express my appreciation to Merrill E. Guest for his 
assistance in preparing this paper. In particular, he prepared 
Figure 1 which was most helpful in interpreting tha distributional 
data. 
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TABLE 2. Results of the Regression Analysis and Sub-Group 

Validation Using Scale Scores and Adjusted Weights 



PRI Level A 

Alternate form correlations 

Multiple correlations 
Standard Error of Estimate 
Number of cases 

Grade 1.5 correlations 

Grade 1.5 N's 

Grade 1.5 actual means 

Grade 1.5 mean of predictions 

Grade 1.5 difference (act. - pred.) 

Grade 2.2 correlations 

Grade 2.2 N's 

Grade 2.2 actual means 

Grade 2.2 mean of predictions 

Grade 2.2 difference (act. - pred.) 



CAT-70 
Vocab . 



.860 

.841 
21.0 
1415 

.823 
412 
328.0 
327.0 
1.0 

.854 
852 
332.7 
333.0 
-.3 



CAT-70 
Comp. 

.770 

.792 
32.7 
1411 

.733 
429 
325.1 
325.8 



.824 
865 
335.3 
336.5 
-.8 



CAT-70 
Total 



.855 

.858 
22.0 
1407 

.834 
412 
313.2 
313.0 
.2 

.877 
844 
320.3 
320.5 
-.2 



PRI Level B 
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Alternate 


form correlations 


.828 


.787 


.858 


Multiple correlations 


.846 


.801 


.862 


Standard Error of Estimate 


27.1 


37.5 


29.3 


Number of 


cases 


3308 


3297 


3292 


Grade 2.2 


standard sa;.ple correlations 


.846 


.792 


.858 


Grade 2.2 


standard sample N's 


640 


642 


639 


Grade 2.2 


standard sample actual means 


340.4 


345.2 


329,3 


Grade 2.2 


standard sample mean of pred. 


338.2 


351.5 


329.2 


Grade 2.2 


standard sample diff. 


2.2 


-6.3 


.1 


Grade 2.2 


black sample correlations 


.708 


.619 


.719 


Grade 2.2 


black sample N's 


707 


708 


704 


Grade 2.2 


black sample actual means 


296.2 


306.1 


281.3 


Grade 2.2 


black sample mean of pred. 


296.5 


303.5 


283.1 


Grade 2.2 


black sample diff. 


-.3 


2.6 


-1.8 


Grade 3.2 


standard sample correlations 


.774 


.781 


.817 


Grade 3.2 


standard sample N's 


801 


800 


799 


Grade 3.2 


standard sample actual means 


370.1 


393.8 


369.2 


Grade 3.2 


standard sample mean of pred. 


373.1 


396.1 


369.7 


Grade 3.2 


standard sample diff. 


-3.0 


-2.3 


-.5 


Grade 3.2 


black sample correlations 


.814 


.774 


.838 


Grade 3.2 


black sample N's 


8A6 


826 


831 


Grade 3.2 


black sample actual means 


328.2 


342.8 


318.1 


Grade 3.2 


black sample mean of pred. 


325.5 


339.3 


315.5 


Grade 3.2 


black sample diff. 


2.7 


3.5 


2.6 
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TABLE 2. (Continued) 



PRI Level C 

Alternate form correlations 

Multiple correlations 
Standard Error o£ Estimate 



Nuinber of 


cases 


Grade 


3.2 


correlations 


Grade 


3.2 


N's 


Grade 


3.2 


actiial means 


Grade 


3.2 


mean of predictions 


Grade 


3.2 


difference (act. - pred.) 


Grade 


4.2 


correlations 


Grade 


4.2 


N's 


Grade 


4.7 


actual means 


Grade 


4.2 


mean of predictions 


Grade 


4.?. 


difference (act. - pred.) 



PRI Level D 

Alternate form correlations 

Multiple correlations 
Standard Error of Estimate 



Number of 


cases 


Grade 


4.2 


correlations 


Grade 


4.2 


N's 


Grade 


4.2 


actual means 


Grade 


4.2 


mean of predictions 


Grade 


4.2 


difference (act. - pred.) 


Grade 


5.2 


correlations 


Grade 


5.2 


N's 


Grade 


5.2 


actual means 


Grade 


5.2 


mean of predictions 


Grade 


5.2 


difference (act. - pred.) 


Grade 


6.2 


correlations 


Grade 


6.2 




Grade 


6.2 


actual means 


Grade 


6.2 


mean of predictions 


Grade 


6.2 


difference (act. - pred.) 



CAT-70 CAT-70 CAT-70 

Vocab. Comp. Total 

.816 .790 .858 

.802 .821 .861 

30.1 36.8 30.2 

1566 1563 1562 

.805 .787 .845 

505 515 520 

369.3 393.8 367.9 

368.9 391.4 367.1 

.4 2.4 .8 

.782 .818 .853 

754 759 764 

395.9 427.4 401.3 

396.8 427.9 401.1 

-.9 -.5 .2 



.848 .837 .895 

.855 .834 .882 

34.4 39.6 33.1 

3799 3796 3794 

.698 .741 .786 

590 587 587 

399.9 430.0 406.6 

402.4 428.3 405.9 

-2.5 1.7 .7 

.860 .834 .886 

1389 137S 1377 

426.7 452.5 429.7 
429.0 453.9 432.4 

-2.3 -1.4 -2.7 

.870 .838 .887 

1695 1684 1684 

459.8 r82.5 462.8 
455.7 ^f81.3 459.7 

4.1 1.2 3.1 
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TABLE 3, Summary Statistics from the Distribution of Scale Score Differences: 
Obtained - Predicted. 

CAT-70 CAT-70 CAT-70 

Reading Reading Reading 

Vocabulary Comprehension Total 



erJc 



GRADE 1.5 
PRI A 
CAT 1 
STANDARD 

GRADE 2.2 
PRI A 
CAT 1 
STANDARD 

GRADE 2.2 
PRI B 
CAT 1 
STANDARD 

GRADF 2.2 
PRI B 
CAT 1 
BLACK 

GRADE 3.2 
PRI B 
CAT 1 
BLACK 

GRADE 3.2 
PRI B 
CAT 2 
STANDARD 

GRADE 3.2 
PRI C ■ 
CAT 2 
STANDARD 

GRADE 4.2 
PRI C 
CAT 2 
STANDARD 

GRADE 4.2 
PRI D 
CAT 2 
STANDARD 

GRADE 5.2 
PRI D 
CAT 3 
STANDARD 

GRADE 6.2 
PRI D 
CAT 3 
STANDARD 



MEAN 
S'.D. 
N 



MEAN 
S.D. 
N 



MEAN 
S.D. 
N 



MEAN 
S.D. 

N 



MEAN 
S.D. 
N 



MEAN 
S.D. 
N 



MEAN 
S.D. 

N 



MEAN 
S.D. 
N 



MEAN 
S.D. 
N 



MEAN 
S.D. 

N 



MEAN 
S.D. 
N 



1.13 
21.58 
431 



-.19 
20.34 

879 



2.14 
25.79 
644 



-.21 
28.25 
717 



2.66 
26.20 
843 



-2.90 
28.53 
803 



.19 
30.66 
532 



-.64 
31.75 
780 



-2.53 
42.05 
590 



-2.33 
33.93 
1390 



4.04 
32.73 
1697 



-.37 
36.63 
431 



-1.36 
31.70 
879 



-6.44 
37.91 
644 



2.51 
39.60 
717 



3.39 
34.49 
843 



-2.18 
37.11 

803 



2.37 
38.25 
532 



-.11 
37.01 

780 



1.66 
44.18 

590 



-1.46 
40.32 
1390 



1.11 
39.67 
1697 



.27 
22.94 
431 



-.28 
20.91 
879 



.01 
28.31 
644 



-1.70 
30.62 
717 



2.42 
26.87 
843 



-.49 
30.14 
803 



.87 

30.05 
532 



.48 
31.71 
780 



.53 

39.09 
590 



-2.71 
32.48 
1390 



3.17 
32.21 
1697 
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• Obtained Scale Score - Deviate 
© Estimated Scale Score - Deviate 
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Figure 1. Distribution of Obtained and Predicted 
Reading Vocabulary Scale Scores for PRI 
Level D, CAT Level 3, Grade 6.2. 
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